mtmd: add batching API#24384
Conversation
|
Hi @ngxson, I just wanted to thank you for the time and patience you put into reviewing my PRs. I have learned a lot about llama.cpp in general, but especially mtmd, through that work. I would like to use that experience to help the team. If you would trust me with it, I would be glad to help with refactoring like #24384, and with the follow-up of migrating the existing models to the new batching API. The migration part especially feels like a good fit for what I have learned. No pressure either way — just tell me the shape you want and I will follow it. Also, related to this: I did some profiling on whether batching gives a significant speed gain, and on the GPU memory overhead, testing on an M3 Max and a few small Nvidia GPUs. On small consumer-grade GPUs the speed gain was not large. Happy to share the numbers if useful. |
|
@sfallah yes I'd appreciate if you can adapt the deepseek-ocr model similar to gemma4v.cpp in this PR some important notes:
|
yes, in fact both ds-ocr versions have bigger (1024x1024) overview images.
ds-ocr v2 doesn't have any issue with this, like most llava uhd style (tile slicing) models I guess. |
no I think I misunderstood my point: any llava-uhd style models slice the input image into multiple smaller (always square) image. for example, one big image can be sliced into 9 tiles. without batching, cgraph only need enough memory to hold 1 image in a single decode pass. batch of 9 images mean you now need 9x memory, which can be too large, so the batch will be conditioned by the number of output tokens; it's not the most reliable way, but roughly correct. so for example, the batch will be conditioned to max 6 tiles, that means the image will be processed in 2 batches: 6 + 3 the main point is that this limit is expected and should always be respected by all models |
|
@ngxson I have it almost ready, I will create a DRAFT PR so you can see it in the code. |
I don't see why we can't. you are assuming that all images in the batch must have the same number of output tokens, but that is not the case. the batching system is flexible such that images with different number of output tokens can be different for each image in the batch. that means even one image in the batch have newline and the rest doesn't have, there is no problem at all. assuming that a whole row need to be encoded will make the logic to be model-specific. there is always cases where you can absolutely fit multiple rows in the same batch (i.e. user simply allow larger batch) all you need to do is to insert the newline to the correct index in the output, it can be done simply by having a loop to concat view 3d output [n_embd, n_tokens, n_batch] as slices of [n_embd, n_tokens], then concat them back while inserting a newline conditionally and that even work if the batch is not row-aligned, for example output can be: [tile, tile, newline, tile, tile] |
That is how my first implementation of ds-ocr dynamic resolution worked, see
To be precise about the layout: a tile's tokens are not contiguous in the final output, so the loop has to interleave tile rows, not concat whole tiles. And that was exactly my problem: it is ugly, model-specific and lives in mtmd. "Insert the newline to the correct index" is the model-specific part; the loop that knows the indices has to live somewhere. So my question to you: where would you put this weaving/assembly so that it stays clean and model-agnostic? I have both variants working now (in-graph weave with row-aligned batches, flat tile batches with assembly-time weave); identical OCR output either way, including non-row-aligned splits. Draft PR coming so you can see it in code. |
|
please correct if I'm wrong, but let's take the non-batching version as the ground truth: for ds-ocr-v1: llama.cpp/tools/mtmd/models/deepseekocr.cpp Line 315 in ebc1077 the on the batched version, you can do that by simply concat the for v2: llama.cpp/tools/mtmd/models/deepseekocr2.cpp Lines 72 to 75 in ebc1077 the
why they aren't contiguous? IIUC output is [n_embd, n_tokens_per_image, n_batch], so they should follow the same order as the input |
This comment was marked as outdated.
This comment was marked as outdated.
|
I think we are mixing two different things here:
Because the HF reference rearranges the tile features into the full image grid before inserting the newlines, see modeling_deepseekocr.py: local_features = local_features.view(height_crop_num, width_crop_num, h2, w2, n_dim2).permute(0, 2, 1, 3, 4).reshape(height_crop_num*h2, width_crop_num*w2, n_dim2)
local_features = torch.cat(
[local_features, self.image_newline[None, None, :].expand(height_crop_num * h2, 1, n_dim2)], dim=1
)
...
global_local_features = torch.cat([local_features, global_features, self.view_seperator[None, :]], dim=0)The I am doing the same in ggml in my batched-encode branch (the #24300 one), see: cur = ggml_reshape_4d(ctx0, cur, n_dim * tile_w, tile_w, grid_x, grid_y); // [n_dim*tile_w, tile_w, grid_x, grid_y]
cur = ggml_cont(ctx0, ggml_permute(ctx0, cur, 0, 2, 1, 3));
...
nl = ggml_repeat_4d(ctx0, model.image_newline, n_dim, 1, gh, 1);
cur = ggml_reshape_3d(ctx0, cur, n_dim, gw, gh); //[n_dim, gw, gh]
cur = ggml_concat(ctx0, cur, nl, 1);Also, master cannot be the ground truth for multi-tile: master's ds-ocr v1 is single-view only, the multi-tile path is what #24300 added. The unconditional For v2 I agree on the layout: tiles have no newline weave, their raw concatenation is already the final output, and My implementation of exactly this layout scores CER 0.0000 against the HF output on the multi-tile eval -- with a tile-contiguous layout that match would not be possible. |
tbh I'm not a fan of introducing 2 changes in one PR. this is the exact root cause of the miscommunication in the past N message between us. if you have 2 different changes, please make it very clear. what I understand is that there are 2 different subjects:
for v1-multi-tile, please push a PR without batching support first. I will not proceed until I understand what it does. for v2-batching, it should be the same case as existing llava-uhd model --> no siginificant problem, right? |
Correct on both.
Agreed on splitting the PRs. |
|
BTW the PR that you closed rather abruptly (#24300) already included functionally everything for a proper DSOCR v1+v2 dynamic-resolution multi-tile batched encoding of tiles in parity with HF reference impls -- carefully crafted, with a solid regression test, perf testing and profiling. The non-batched PR you are asking for is essentially my dyn-res branch (https://github.com/sfallah/llama.cpp/tree/sf/deepseek-ocr-mul-tile-dyn-res); the batched layer on top is exactly what #24300 did. But I understand your point about splitting the PRs, so I will do that. I just want to make sure we are on the same page about the content of each PR and the implications of the different approaches. Concretely: method 1 (the in-graph weave) needs the whole grid in one graph, so the non-batched first PR will follow the dyn-res approach; the in-graph weave then belongs to the batched layer. |
I might be a bit hard here, but the value of open source contribution is not only about "the code works", but also about planning and communication. you cannot expect pushing a large PR that contains multiple (unrelated) changes and having someone else to fully understand it, that's not how code review work. your own comment #24300 (comment) also pointed out independent changes that can be split to smaller PR, why don't we do that instead? not necessarily 4 separate PRs, but you get the idea. I did acknowledge the first one #24352 and as a proof: the review for that PR was straight-forward.
I don't quite understand your intent here, but to make it clear: I expect the first version that simply doesn't use the 4th dim (n_batch); that dim should always be 1 in the cgraph I imagine such change will affect just 2 places:
|
|
merging this change after CI passes, the tests.sh is also passed: |
Overview
Supersede #24300
Also fix #24380
Add a generic batching API to mtmd and wire it up to
llama-server, the goal is to speed up llava-uhd-style models and at the same time, improve video processing speedCurrent state:
llama-servercan use it correctlyTODO:
mtmd_batch_add_chunkshould only accept input with same sizemtmd_batch_encodeto use the 4th batch dim, added via mtmd: build_vit batching #24352build_vit()models for nowmtmd-clito use batching API --> skip, we don't actually need thatHow it works
mtmd_batchobjectmtmd_batch_add_chunkuntil it returns an error (either batch is full or current chunk can't be batched)mtmd_batch_encodeon the batchmtmd_batch_get_output_embdRequirements